[Day 12] K nearest neighbors — 主題實作 - iT 邦幫忙::一起幫忙解決難題，拯救 IT 人的一天

2023 iThome 鐵人賽

DAY 12

AI & Data

ML From Scratch系列第 12 篇

[Day 12] K nearest neighbors — 主題實作

15th鐵人賽 machine learning python

whoami

2023-09-12 11:40:01

519 瀏覽

分享至

了解完 K nearest neighbors 的理論後，我們今天會透過著名的 iris 資料集來實做它。

Implementation

Import Library

首先透過使用 scikit-learn 的 library 來實做

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

Load Dataset

iris = pd.read_csv("../input/iris/Iris.csv") #Load Data
iris.drop('Id',inplace=True,axis=1) #Drop Id column

這裡注意到，資料集的路徑必須隨檔案路徑更改

之後移除不必要的 column Id

X = iris.iloc[:,:-1] #Set our training data
y = iris.iloc[:,-1] #Set training labels

定義資料集的資料跟標籤

Divide the data into train and test

X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size = 0.2, random_state=42) #split the  data into traing and validating

Sklearn Implementation

from sklearn.neighbors import KNeighborsClassifier
skmodel = KNeighborsClassifier(n_neighbors=7)
skmodel.fit(X_train, y_train)

sk_predictions = skmodel.predict(X_test)
sk_accuracy = compute_accuracy(y_test, sk_predictions)
print(f" sklearn-model got accuracy score of : {sk_accuracy}")

k 值目前設 7

這裡我們得到使用 KNN 分類的準確率是 96.67 %

KNN from Scratch

class KNN:
    def __init__(self, n_neighbors=5):
        self.n_neighbors = n_neighbors
        
    def euclidean_distance(self, x1, x2):
        return np.linalg.norm(x1 - x2)

    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

    def predict(self, X):
        # Create empty array to store the predictions
        predictions = []
        # Loop over X examples
        for x in X:
            # Get prediction using the prediction helper function
            prediction = self._predict(x)
            # Append the prediction to the predictions list
            predictions.append(prediction)
        return np.array(predictions)

    def _predict(self, x):
        # Create empty array to store distances
        distances = []
        # Loop over all training examples and compute the distance between x and all the training examples 
        for x_train in self.X_train:
            distance = self.euclidean_distance(x, x_train)
            distances.append(distance)
        distances = np.array(distances)
        
        # Sort by ascendingly distance and return indices of the given n neighbours
        n_neighbors_idxs = np.argsort(distances)[: self.n_neighbors]
        
        # Get labels of n-neighbour indexes
        labels = self.y_train[n_neighbors_idxs]                  
        labels = list(labels)
        # Get the most frequent class in the array
        most_occuring_value = max(labels, key=labels.count)
        return most_occuring_value